Identifying the minimal sets of distance restraints for FRET-assisted protein structural modeling

Proteins naturally occur in crowded cellular environments and interact with other proteins, nucleic acids, and organelles. Since most previous experimental protein structure determination techniques require that proteins occur in idealized, non-physiological environments, the effects of realistic cellular environments on protein structure are largely unexplored. Recently, Förster resonance energy transfer (FRET) has been shown to be an effective experimental method for investigating protein structure in vivo. Inter-residue distances measured in vivo can be incorporated as restraints in molecular dynamics (MD) simulations to model protein structural dynamics in vivo. Since most FRET studies only obtain inter-residue separations for a small number of amino acid pairs, it is important to determine the minimum number of restraints in the MD simulations that are required to achieve a given root-mean-square deviation (RMSD) from the experimental structural ensemble. Further, what is the optimal method for selecting these inter-residue restraints? Here, we implement several methods for selecting the most important FRET pairs and determine the number of pairs Nr that are needed to induce conformational changes in proteins between two experimentally determined structures. We find that enforcing only a small fraction of restraints, Nr/N≲0.08, where N is the number of amino acids, can induce the conformational changes. These results establish the efficacy of FRET-assisted MD simulations for atomic scale structural modeling of proteins in vivo.


Introduction
Knowing the three-dimensional structure of proteins enables us to understand the biophysical mechanisms that control protein function, protein-protein interactions, and cell signaling.Nearly all protein structures that have been determined experimentally to date have been characterized under idealized, non-physiological conditions.For example, proteins have been crystallized into nonnative, solid phases for x-ray scattering experiments 1 , and proteins have been dissolved into dilute, non-physiological buffers for NMR spectroscopy or cryo-electron microscopy 2,3 .However, proteins carry out their functions in cellular environments that are significantly different from these in vitro conditions.The cellular environment is crowded with a non-solvent packing fraction of 0.3-0.4 that includes nucleic acids, carbohydrates, lipids, organelles, and other components [4][5][6] .This environment impacts the physical properties of proteins, including the protein's radius of gyration, melting temperature, and rotational diffusion coefficient [7][8][9][10] .Currently, there are over 200,000 in vitro protein structures deposited in the Protein Data Bank (PDB) 11 , as well as more than 10 6 computational models predicted by AlphaFold and RosettaFold 12,13 .To date, in vivo structures have been obtained for only three proteins (TTHA1718, GB1, and ubiquitin) using in-cell NMR [14][15][16] .In-cell NMR is not widely used since it is difficult to distinguish the isotopic labeled target protein from its environment, which leads to a low signal-to-noise ratio and sensitivity [17][18][19] .Another experimental method for solving protein structure in vivo is cryo-electron tomography 20 , however, this technique requires proteins to be confined to a thickness less than 500 nm, which limits this method to only a subset of cell types and membrane-associated proteins 21,22 .In addition, there have been numerous computational studies of proteins in vivo.All-atom molecular dynamics (MD) simulations have investigated protein structure in the presence of nearly all components of the cell cytoplasm 23,24 .However, current force fields have been calibrated to in vitro protein structures and we do not have accurate potentials for interactions between proteins and nucleic acids, ribosomes, and organelles 25,26 .Thus, we do not yet have a quantitative, atomistic-level understanding of protein structure in cells.
Förster resonance energy transfer (FRET) can be used to determine the separations between donor and acceptor chromophores attached to a pair of amino acids in a given protein by measuring their energy transfer efficiency.The distribution of separations between the two amino acids can then be deduced by calculating the configuration space volumes sampled by the donor and acceptor chromophores [27][28][29] .FRET pair labeling is highly specific and offers high accuracy for the interresidue separations with uncertainties less than 2-4 Å 30,31 .Additionally, FRET-labeled proteins can be directly expressed or injected into cells, enabling single molecule measurements of protein structure, dynamics, and stability in vivo 7,32,33 .However, most FRET experimental studies only obtain inter-residue separations for a small number of amino acid pairs in a given protein 7,10,33 .To investigate the atomic scale structure of proteins in vivo, inter-residue separations obtained from FRET experiments can be incorporated into molecular dynamics (MD) simulations.Current all-atom MD simulations of proteins using in vitro solution conditions have been shown to sample in vitro protein structures obtained from x-ray crystallography and NMR spectroscopy 34,35 .By including a sufficient number of inter-residue restraints from FRET studies of proteins in vivo into the MD force fields that were developed for in vitro solution conditions, it may be possible to sample the in vivo structural ensemble of proteins.
Several key questions must be answered before FRET-assisted structural modeling of protein structure in vivo can be employed.In particular, given the large number of possible FRET pairs, what is the minimum number of amino acid pairs for which we need distance restraints to achieve a given accuracy for the root-mean-square deviations (RMSD) between the C α positions in the restrained simulations and those in the experimental structures?However, we do not yet have access to high-resolution in vivo protein structures.Thus, we will first develop the methodology for carrying out restrained MD simulations for protein structural modeling using proteins that are found in multiple conformational states in vitro.The hypothesis is that the method that we use to induce conformational changes in vitro can also be used to study conformational changes in in vivo environments.The initial conformational state will be metastable in the MD force field without restraints, and we will induce a conformational change in the protein by incorporating restraints between amino acid pairs that are satisfied in the target state.To our knowledge, there have only been a few studies aimed at identifying the most important restraints for efficiently moving to the conformational ensemble of the target state 36 .For example, currently we do not know the minimal number of restraints and which restraints are necessary to achieve a given RMSD in the C α positions from the target state, and how random selection of given number of restraints compares to other methods for selecting restraints.To connect the in vitro results to those for in vivo conditions, we will also study one of the few proteins whose structure has been solved in vivo using in-cell NMR spectroscopy.
Here, we select four proteins that each can take on two, distinct conformational states to develop the restrained MD simulation methodology: T4 Lysozyme (172L/1L69), Phosphoglycerate Kinase (2XE6/2Y3I), Adenylate Kinase (4AKE/1AKE), and Tick Carboxypeptidase Inhibitor (1ZLI/2JTO), where the first and second PDBIDs indicate the initial and target structures, respectively.We initialize the MD simulations with the crystal structure of the initial state, add a given number of C α distance restraints, and calculate the C α RMSD relative to the target structure.For the first three proteins (T4 Lysozyme (T4L), Phosphoglycerate Kinase (PGK), and Adenylate Kinase (AK)) that are metastable in the force field, we test four methods (normal mode analysis, the largest C α separation method, the largest change in pairwise separation method, and linear discriminant analysis) for selecting the restraints and compare the C α RMSD to the target structure to that obtained from random selection of the restraints.We also vary the number of restraints to determine the minimum number of restraints needed to achieve a given C α RMSD from the target.Two of the methods (normal mode analysis and the largest C α separation method) do not use information about the target structure, whereas the other two methods (the largest change in pairwise separation method and linear discriminant analysis) compare the initial and target structures to identify the most important restraints.We find that for T4L, PGK, and AK, which take on two distinct conformational states in vitro, we can induce the conformational changes using only a small fraction of restraints, N r /N ≲ 0.02, where N is the number of amino acids in the protein.For the proteins that we considered, this result corresponds to 1-5 restraints, which is a number that can readily be achieved in FRET experiments.In addition, we studied one of the few proteins that has been characterized using in-cell NMR spectroscopy: the B1 domain of protein G (GB1).We need a slightly larger fraction of restraints, N r /N ∼ 0.08 (or 5 restraints), to change the protein conformation from the initial in vitro structure (2N9K) to the in-cell NMR structure (7QTS).Using our methods to induce conformational changes from the bound to the unbound conformations of Tick Carboxypeptidase Inhibitor (TCI) and from the in vivo to the in vitro structures of GB1, we show that the fraction of restraints required to induce conformational changes depends on the stability of the target state in the force field.In general, the largest change in pairwise separation method, which has information about the target structure, yields the lowest values for the C α RMSD for a given N r .If the restraint selection method does not have information about the target structure, the C α RMSD is still lower than that for random selection.These results establish the feasibility of FRET-assisted structural modeling of proteins in vivo using restrained MD simulations, but also emphasize the need for improved force fields for MD simulations of proteins in vivo.

Selected Proteins
We identified three proteins from the Protein Data Bank (PDB) that can be crystallized into two distinct conformational states in vitro and can be used to develop the restrained MD simulation methodology: T4 Lysozyme (T4L), Phosphoglycerate Kinase (PGK), and Adenylate Kinase (AK).(See Table 1.) [37][38][39][40][41][42] .These proteins have been studied extensively in the context of protein conformational changes 43 .While the selected targets are in vitro structures, recent studies suggest that the differences between the in vivo and in vitro structures for PGK are largely caused by the hinge motion of the two domains that occurs between the initial and target in vitro structures 8,33 .Previous studies have also suggested that the crowded cellular environment can stabilize the target in vitro structure for AK 44 .For the restrained MD simulations, we require that the initial structure is stable and the target structure is at least metastable over long time scales in MD simulations, which ensures that the unrestrained dynamics does not induce the transition from the initial state to the target.(See Fig. S1 in Supporting Information.)We also consider one protein that has both an in vitro and in-cell NMR structure, the B1 domain of protein G (GB1) 16,45 .We use the first model from the in vitro NMR bundle as the initial structure and the first model of the in-cell NMR bundle as the target structure.(See Fig. S6 in Supporting Information.)We find similar results for the restrained and unrestrained MD simulations when we initialize the system using the other NMR models.To investigate the dependence of the restraint selection method on the stability of the target structure, we study heterodimer, Tick Carboxypeptidase Inhibitor (TCI) 46,47 , which is metastable in its bound conformation, but unstable in its unbound conformation.(See Fig. S7  Table 1: We list the selected proteins including their PDBIDs for the initial and target structures, number of amino acids N , the C α RMSD (Eq.2) between the initial and target structures, and the melting temperature T m .For GB1, the C α RMSD is calculated between the first model of the in vitro NMR bundle and the first model of the in-cell NMR bundle.

C α RMSD
To compare the protein structures from the restrained MD simulations (i.e.structure S i ) and the target structure (i.e.structure S j ), we define the C α separation vector for the βth amino acid, where ⃗ r i,β is the position of the C α atom on amino acid β in structure S i .We define the rootmean-square deviation in the C α positions between two structures S i and S j as When comparing two structures S i and S j , we typically align them (i.e.rotate one of them) to achieve the minimum value of the C α RMSD(S i , S j ) for a given S i and S j .We can also calculate the C α RMSD between the restrained and target structures averaged over an ensemble of restrained structures for each amino acid β: where N s is the number of protein structures in the restrained ensemble {S i } and S j represents the target structure.

Restraint selection methods
Below, we describe the methods that we employ for selecting the C α distance restraints.Linear spring restraints will be added to MD simulations of the initial protein structure, where the rest lengths are obtained from the target structure, as discussed below.Three of the restraint selection methods, random selection, normal mode analysis, and the largest C α separation method, do not use information about the target structure.Two additional methods, the largest change in pairwise separation method and linear discriminant analysis, compare the initial and target structure to identify the most effective restraints.

a) Random selection
In a protein with N amino acids, there are N p = N (N − 1)/2 distinct amino acid pairs.As shown in Fig. 1a, the pair separations between C α atoms can be represented using a symmetric distance matrix R βδ , where β,δ = 1, . . ., N .To establish a baseline for the performance of the restrained MD simulations, we will first consider random selection of the restraints.In this approach, we exclude amino acid pairs that are too close (R βδ < R g , where R g is the radius of gyration), and pairs for which at least one amino acid is buried with relative solvent accessible surface area rSASA < 0.1 since such pairs will preclude FRET measurements 53,54 , which reduces the pool of restraints to approximately N p /3 amino acid pairs.For each number of restraints N r ≪ N p we consider, we select 100 sets of N r restraints randomly from the pool of allowed pairs.

b) Normal mode analysis
For this method, we assume that the normal modes of vibration of the initial structure provide information about how the protein transitions from the initial to the target structure.We follow the methodology used in other recent work aimed at selecting the most effective FRET pairs for restrained MD simulations of proteins 36 .First, the method constructs a coarse-grained elastic network from the atomic positions of the initial protein structure and calculates the normal modes of the network 55 .Then, the initial structure is displaced by a linear combination of the ten lowest frequency modes with amplitudes that are inversely proportional to the frequency and have proportionality constants −1 < ξ < 1 that are chosen randomly.The normal modes are then recalculated on the displaced structure and the structure is perturbed again using a linear combination of the ten lowest frequency modes with amplitudes chosen as before.This process of successive displacements along the lowest frequency normal modes is continued until 10 3 structures are obtained 56 and then repeated ten times with independently generated random mode amplitudes to yield a total of N = 10 4 structures.100 of the structures for T4L are shown in Fig. 1b.
To identify the amino acid pairs that should be restrained, we seek to minimize the RMSD of the positions of the C α atoms (Eq.2) among the protein structures in the ensemble, where the weights for each structure in the ensemble are controlled by the size of the fluctuations in the amino acid pair separations among structures.Deviations in the pair separations between two structures S i and S j can be quantified using where ∆R βδ (S i , S j ) = |R i βδ − R j βδ |, {(β, δ)} is a given set of amino acid pairs, and R i βδ is the distance between C α atoms on amino acids β and δ on structure S i .To estimate the probability of observing a mean-square deviation in the pair separations larger than χ 2 (S i , S j ), we calculate where N m is the number of amino acid pairs in the set {(β, δ)} and f (N m , χ 2 ) is the chi-squared distribution with N m degrees of freedom.To quantify the average C α RMSD over the ensemble, we calculate To minimize ⟨RMSD⟩, the structures for which the selected amino acid pairs have large deviations in their pair separations (i.e.small P (S i , S j )) should possess large C α RMSD, and the structures for which the selected amino acid pairs have small deviations (i.e.large P (S i , S j )) should possess small C α RMSD.To select the set of amino acid pairs that minimize ⟨RMSD⟩, we add one pair to the set of optimal pairs at a time.We start with identifying the single pair (β * , δ * ) that minimizes ⟨RMSD⟩ and use it as the restraint for N r = 1.We then consider two possible pairs, but fix one of the pairs to be (β * , δ * ) and find the new pair (β * * , δ * * ) in the set {(β * , δ * ), (β * * , δ * * )} that minimizes ⟨RMSD⟩.We use these two pairs as the set of restraints for N r = 2.This process continues until we have N r = 1, . . ., N max , where N max /N < 0.08 for all proteins considered.

c) Identifying the largest C α separations in the initial structures
For this method (i.e. the largest C α separation method) for selecting important restraints, we identify the amino acid pairs with the largest C α separations, or the maximum R βδ over all pairs {(β, δ)} in the initial protein structure.(For example, in Fig. 1c, we show the amino acid pair with the largest C α separation in the x-ray crystal structure for T4L (172L).)After identifying the pair (β, δ) with the largest C α separation, we find the pair (β ′ , δ ′ ) with the second largest C α separation.
This process is then continued to find a set of pairs with the largest C α separations.However, we seek to identify a minimal set of amino acid pairs without redundant structural information.Thus, we carried out unrestrained MD simulations starting from the initial structure and determined the Pearson correlation between the pairwise C α separations.In particular, we include (β ′ , δ ′ ) in the pool of selected pairs if the Pearson correlation ρ between R βδ and R β ′ δ ′ satisfies |ρ| < 0.9.We also require that the C α atoms of the new pair are not already in the pool of selected restraints.Thus, we first add the pair (β, δ) with the largest C α separation to the pool of restraints (N r = 1).We add (β ′ , δ ′ ) with the second largest C α separation to the pool of restraints (N r = 2) as long as it is uncorrelated with (β, δ) and does not include C α atom β or δ.If so, we consider the pair (β ′′ , δ ′′ ) with the next largest separation.We follow this process until we have sets of restraints with N r = 1, . . ., 5.

d) Identifying the largest change in pairwise C α separations between the initial and target structures
For this method (i.e. the largest change in pairwise separation method), we assume that the target structure is known.We identify the amino acid pair with the largest change in pairwise C α separations between the initial and target structures S i and S j (i.e. the maximum ∆R βδ (S i , S j )).
(For example, in Fig. 1d, we show the amino acid pair with the largest change in C α separations between the initial (172L) and target (1L69) x-ray crystal structures for T4L.Note that the amino acids with the largest change in pairwise separation are not the same as those with the largest C α separation.)To identify a set of non-redundant restraints, we successively implement new restraints in MD simulations for pairs with the largest deviation in the pair separation from the target.In particular, assume that pair (β, δ) has the largest ∆R βδ (S i , S j ) in the unrestrained MD simulations starting from the initial structure.We then carry out MD simulations with R βδ restrained to the target value.We identify the pair (β ′ , δ ′ ) with the largest ∆R β ′ δ ′ (S i , S j ) and carry out restrained MD restraints enforcing the target values for both R βδ and R β ′ δ ′ .We also require that the C α atoms of the new pair are not the same as any of the atoms in the current pool of restraints.We use (β, δ) as the restraint for N r = 1 and use (β, δ) with (β ′ , δ ′ ) as the set of restraints for N r = 2.
This process continues until we have N r = 1, . . ., N max , where N max /N < 0.08 for all proteins we considered.

e) Linear discriminant analysis
For the linear discriminant analysis method of selecting restraints, we also assume that the target structure is known.We seek to identify the pairwise separation between C α atoms that can serve as a collective variable enabling the protein to move from the initial to the target structure 57 .The inspiration for this method is the linear discriminant analysis method that was used to identify the collective variables for small molecule conformational changes 58 .Here, we apply the method in the context of large conformational changes in proteins.Linear discriminant analysis requires a distribution of protein structures, not a single structure.Thus, we first carried out short 20 ns unrestrained MD simulations starting from both the initial and target structures to sample an ensemble of structures near the initial and target structures.In Fig. 1e, we show these two distributions of structures for T4L as an example.In this case, we aim to find the collective variable that tracks the hinge closure motion of the two domains.Each protein conformation can be described by the set of all pairwise separations between C α atoms, where the length of the vector is the number of distinct amino acid pairs N p minus the ones that are too close together (R βδ < R g ) and the ones where at least one amino acid is buried.We can describe the distribution of protein conformations sampled near the initial and target structures as ρ A ( ⃗ X A ) and ρ A ( ⃗ X B ), respectively.We aim to find the direction Ŵ in the space of C α separations such that its projection onto the cross-correlation matrix (between initial and target distributions) is maximized, while its projection onto the covariance matrix for the initial or target distributions is minimized.To determine Ŵ , we first calculate the mean separations ⟨ ⃗ X A ⟩ and ⟨ ⃗ X B ⟩ from the distributions of the initial and target structures.Second, we define the (cross-correlation) scattering matrix which quantifies the covariance of the initial distribution with the target distribution, and define the projection of Ŵ onto S b as Ŵ T S b Ŵ .We also quantify the covariance of the initial and target distributions, separately: We can then calculate the (direct correlation) scattering matrix, with projection Ŵ T S w Ŵ .To find the direction Ŵ onto which the projections of the two distributions ρ A ( ⃗ X A ) and ρ B ( ⃗ X B ) are best separated, we can maximize the Rayleigh ratio: which is equivalent to solving for the eigenvector Ŵλ associated with the largest eigenvalue λ of We identify the most important pairwise distances as those that have the largest weights in Ŵλ .
For N r = 1, we choose the amino acid pair (β, δ) with the largest weight.For N r = 2, we choose the two amino acid pairs (β, δ) and (β ′ , δ ′ ) with the two largest weights, and so on.

Restrained MD simulations
To assess the performance of the FRET pair selection methods, we carry out all-atom MD simulations starting from the initial structure and incorporating the C α -C α separations of the selected pairs from the target structure as the equilibrium lengths of the linear spring restraints.We then monitor the RMSD of the C α atom positions (Eq.2) from the target structure as a function of time.
Unrestrained and restrained MD simulations were performed using the AMBER99SB-ILDN force field 59,60 in the GROMACS software package 61 .The MD simulations were carried out in a periodic dodecahedron-shaped box that is sufficiently large such that the protein surface is at least 20 Å from the box edges.The simulation box was solvated with water molecules modeled using TIP3P at neutral pH and 0.15M NaCl 62,63 .Short-range van der Waals and screened Coulomb interactions were truncated at 1.2 nm, while long-ranged electrostatic interactions were tabulated using the Particle Mesh Ewald summation method.The LINCS algorithm was used to constrain the bond lengths.We performed two energy minimization runs to first relax the protein and then the water molecules and the protein together using the steepest decent algorithm until the maximum net force magnitude on an atom is smaller than 500 kJmol −1 nm −1 .We perform NVT simulations of the system for 20 ns at temperature T = 300K using a velocity rescaling thermostat for sampling the canonical ensemble 64 .The equations of motion for the atomic coordinates and velocities are integrated using a leapfrog algorithm with a 1 fs time step.
For the restrained simulations, we employ a linear spring potential for each restrained amino acid pair (β, δ): where R 0 βδ is the C α -C α separation for the (β, δ) pair in the target structure and the spring constant k r = 5000 kJmol −1 nm −2 is chosen so that the averaged root-mean-square deviation between R βδ and R 0 βδ is < 0.2 Å. (See Fig. S2 in Supporting Information.)For the unrestrained MD simulations for each protein, we ran N s = 100 simulations for 20 ns starting from the initial structure, but with different initial velocities for each atom randomly selected from a Maxwell-Boltzmann distribution at T = 300K.Using these simulations, we can calculate the average C α RMSD between the initial structures and the target structure, which serves as a baseline for comparison to the results for the restrained MD simulations.For the random restraint selection method at each N r , we randomly select N s = 100 sets of N r amino acid pairs and run one restrained MD simulation (for 20 ns) for each set of restraint pairs.For all other restraint selection methods, we carry out N s = 50 simulations for each set of N r restraints, but with different random initial velocities.The number of samples N s is chosen so that average C α RMSD converges at large N s as shown in Fig. S3 in the Supporting Information.For each N r and for each restraint selection method, we average the C α RMSD over all N s simulations.For the restrained MD simulations of the in vitro protein structures, the C α RMSD converges within the first 5 ns, as shown in Fig. S4 in the Supporting Information.Therefore, we only use the data generated from 5 to 20 ns for the calculations of the C α RMSD.For the in vivo target GB1, we extended the MD simulations to 50 ns and use the time points from 20 to 50 ns to calculate the C α RMSD.(See Fig. S6c in Supporting Information.) To determine the ability of the restraints to move the protein conformation from the initial to the target structure, we need to measure a reference C α RMSD (i.e.RMSD r ) that quantifies thermal fluctuations around the target structure.For the proteins T4L, PGK, and AK, we determine RMSD r by running unrestrained MD simulations starting from the target structure.We set RMSD r = RMSD(S i , S j ), where S i is the central structure from the unrestrained MD simulations starting from the target structure, and S j is the target structure.We find the central structure by identifying the structure that has the largest number of neighboring structures in the ensemble using a cutoff C α RMSD of 1 Å.If the average C α RMSD obtained from the restrained MD simulations is below RMSD r , the restraints were successful in moving the initial structure to the target structure since the differences are comparable to thermal fluctuations observed near the target structure.For the in vitro and in-cell NMR structures of GB1 and the unbound structure of TCI, we set RMSD r = N −1 m i>j RMSD(S i , S j ), where S i and S j are distinct models in the in-cell NMR bundle and N m is the number of distinct pairs of models.Does the choice of the amino acid restraints have a significant impact on whether the initial protein structure can be moved to the target structure?As an example, we consider two possible choices for single amino acid restraints in restrained MD simulations of T4L.In Fig. 2, we plot the C α RMSD(S i , S j ) as a function of time, where S i are the structures sampled in the restrained MD simulations and S j is the target structure, for two possible choices for single amino acid restraints.Although the two restrained MD simulations start from the same initial structure (green) in Fig. 2a, the C α RMSD(S i , S j ) display different time dependence.The C α RMSD from the MD simulations using the (β ′ , δ ′ ) restraint rapidly decreases below the reference value, RMSD r , for the target structure.When we visualize the final frame from this restrained MD simulation in Fig. 2c, we find that the two subdomains of T4L are now closer together and there is strong alignment between the final frame (green) and target structure (blue).In contrast, the C α RMSD does not decrease below RMSD r for the MD simulations with the (β, δ) restraint, even though the restraint is well-satisfied during the simulations.This example emphasizes the importance of the restraint selection method.Below, we describe the performance of four restraint selection methods in moving an initial structure toward a target structure.We compare the performance of restraint selection methods that have information about the target structure and those that do not, and benchmark their performance to random restraint selection.In addition, we test the methods on moving four proteins from one in vitro structure to another, one protein from an in vitro structure to an in-cell structure, and the same protein from its in-cell structure to its in vitro structure.To quantify the performance of the four restraint selection methods in moving the initial structure to the target structure, in Fig. 3, we plot ⟨RMSD(S i , S j )⟩ S i ,Ns averaged over structures S i from the restrained MD simulations and number of samples N s , where S j is the target structure, as a function of the number of restraints N r for several restraint selection methods and three proteins.In Fig. 3a, we show the results for T4L.When N r = 0 (i.e.unrestrained MD simulations), ⟨RMSD(S i , S j )⟩ S i ,Ns ∼ 4 Å and the structures sampled in the MD simulations remain far from the target structure.When randomly selecting restraints, ⟨RMSD(S i , S j )⟩ S i ,Ns decreases slowly with increasing N r and does not reach RMSD r even after five restraints have been included.In contrast, if we use structural information about the target to select the restraints (i.e. using linear discriminant analysis or identifying the largest change in pairwise C α separations between the initial and target structures), ⟨RMSD(S i , S j )⟩ S i ,Ns rapidly decreases to RMSD r after adding only one restraint.The normal mode analysis and largest C α separation methods, which do not use information about the target, achieve intermediate results; ⟨RMSD(S i , S j )⟩ S i ,Ns decreases to RMSD r within N r = 2-3 restraints.Thus, both of these methods (normal mode analysis and the largest C α separation method) are more efficient than random selection for moving the initial structure to the target for T4L.

Performance of restraint selection methods
To investigate the performance of restraint selection methods in moving an initial structure to the target structure for a wider range of proteins, we performed unrestrained and restrained MD simulations on two larger proteins, PGK (Fig. 3b) and AK (Fig. 3c), and find similar results for the variation of ⟨RMSD(S i , S j )⟩ S i ,Ns with N r .(Note that for AK, we needed to add a background of intra-domain restraints to recapitulate the B-factor in the "unrestrained" MD simulations as shown in Fig. S5 in Supporting Information.)For example, for random restraint selection, ⟨RMSD(S i , S j )⟩ S i ,Ns slowly decreases with increasing N r , and ⟨RMSD(S i , S j )⟩ S i ,Ns > RMSD r even for N r = 5.For PGK and AK, all restraint selection methods (except linear discriminant analysis) outperform random restraint selection.Normal mode analysis and the largest C α separation method, which do not include information about the target, achieve lower values of ⟨RMSD(S i , S j )⟩ S i ,Ns than that for random selection, but their relative performance depends on the protein.For example, ⟨RMSD(S i , S j )⟩ S i ,Ns decreases faster with N r for the largest C α separation method for PGK, whereas ⟨RMSD(S i , S j )⟩ S i ,Ns decreases faster for the normal mode analysis method for AK.As expected, the method of identifying the largest change in pairwise C α separations between the initial and target structures is the best performing method with ⟨RMSD(S i , S j )⟩ S i ,Ns ∼ RMSD r after only a small number of restraints (5 for PGK and 2 for AK).The performance of the restrained MD simulations based on the linear discriminant analysis method of selecting restraints is comparable to that for random restraint selection even though it incorporates structural information about the target.This result likely stems from the fact that a nonlinear method is needed to identify the collective variables in proteins.These results demonstrate that the addition of a small number of pairwise distance restraints in restrained MD simulations can effectively move a protein from an initial structure to a target structure that is more than 4 Å away.However, the minimal number of restraints required for ⟨RMSD(S i , S j )⟩ S i ,Ns ∼ RMSD r depends on the protein.What controls the minimum number of restraints?To address this question, we calculate the normalized C α RMSD:

Minimum fraction of restraints
where ⟨RMSD(S i , S j )⟩ S i ,Ns (0) is ⟨RMSD(S i , S j )⟩ S i ,Ns from unrestrained MD simulations with N r = 0. Thus, RMSD(N r ) = 1 indicates that the protein samples the ensemble of initial structures and RMSD(N r ) = 0 indicates that the protein samples the target ensemble.In Fig. 4, we show that RMSD(N r ) collapses for the three proteins T4L, PGK and AK when plotted versus N r /N and using the optimal restraint selection method (i.e. the largest change in pairwise separation).In this case, only a small fraction of restraints, less than 1.5% of the protein size, is required to move between the two in vitro protein structures.For the normal mode analysis restraint selection method, which does not consider information about the target, the collapse of RMSD with N r /N is not as complete, but RMSD ∼ 0 for N r /N ≳ 0.02 for all proteins.Thus, even in the absence of complete knowledge of the target structure, we can induce changes in protein structure from an initial in vitro structure to a target in vitro structure using only a small fraction of amino acid separation restraints.
Moving from initial in vitro structure to target in-cell structure for GB1 To investigate whether the proposed methodology can also move an initial in vitro structure to a target in vivo structure, we also carried out unrestrained and restrained MD simulations for the B1 domain of protein G (GB1) 16 .We find that inter-bundle ⟨RMSD(S i , S j )⟩ S i ,S j ∼ 3.0 Å between the in vitro NMR models S i and the in-cell NMR models S j is rather small, mostly due to small changes in the two top loop regions shown in Fig. 5a.Despite this, the C α RMSD between the in vitro and in-cell NMR bundles is larger than both of the intra-bundle fluctuations.Can we use similar restraint selection methods to those above to move from an in vitro structure to an in-cell structure?In contrast to the in vitro target structures for T4L, PGK, and AK, the in-cell target structure for GB1 is not a local minimum of the AMBER99SB-ILDN force field.(See Figs.S8a and b in the Supporting Information.)As a result, simulating in-cell protein structures is more challenging.In Fig. 5b, we implement the largest change in pairwise separation method for identifying restraints and calculate ⟨RMSD(S i , S j )⟩ S i ,Ns as a function of N r .We find that the target in-cell structures are sampled for N r ≥ 5 in restrained MD simulations, which corresponds to N r /N ∼ 8%.Unlike for the in vitro structures, where ⟨RMSD(S i , S j )⟩ S i ,Ns monotonically decreases as N r increases, ⟨RMSD(S i , S j )⟩ S i ,Ns exhibits non-monotonic behavior with N r for GB1.To investigate this effect, we calculate the C α RMSD between the structures from the MD simulations and the target for each amino acid β individually (Eq.3).As shown in Fig. 5c, the deviations in the structures sampled in the unrestrained simulations and target structure occur predominantly in the top two loop regions in Fig. 5a with 21 ≤ β ≤ 26 and 47 ≤ β ≤ 52.The deviations between the initial and target structures can be decreased in the loop regions to ≲ 2 Å for N r = 5.However, the structural deviations at amino acid positions adjacent to the loop regions, for example, 17 ≤ β ≤ 20 and 27 ≤ β ≤ 31, begin to increase as N r increases.These results suggest that the restraints are competing with the force field when attempting to move the protein to the in-cell structure.The non-monotonic behavior in ⟨RMSD(S i , S j )⟩ S i ,Ns gives rise to a higher fraction of restraints that are necessary to move GB1 from the initial in vitro structure to the in-cell structure.Several studies have suggested that GB1 is a single-domain protein, not a two-domain protein as for T4L, PGK, and AK 65 .It is possible that the larger fraction of restraints needed to induce a conformational change in GB1 (relative to the fraction needed for T4L, PGK, and AK) is related to the fact that it behaves like a single-domain protein.However, the variation in the minimum fraction of restraints can also be related to the stability of the target structure in the force field without restraints.To investigate this effect, we calculated the minimum number of restraints needed to move the "unstable" in vivo structure of GB1 to its "metastable" in vitro structure.We carried out restrained MD simulations of GB1 using restraints selected from the largest change in pairwise separation method.We find that only ∼ 3% of the restraints are needed to move from the in vivo to the in vitro structure for GB1 (Fig. 6a), whereas ∼ 8% of the restraints are needed to move from the in vitro to the in vivo structure.We also studied another two-domain protein Tick Carboxypeptidase Inhibitor (TCI) that undergoes a conformational change upon binding to its receptor.(The bound and unbound structures have PDBIDs 1ZLI and 2JTO, respectively.)The unbound NMR structure is unstable in the AMBER99SB-ILDN force field as shown in Fig. S7.We find that the fraction of restraints needed to move from the initial bound state to the target unbound state of TCI is significantly higher (∼ 7%) than the fraction of restraints needed to move from the unstable in vivo state to the stable in vitro state of GB1, even though TCI is a two-domain protein.(See Fig. 6b.)Therefore, the characteristic fraction of restraints needed to induce conformational changes is primarily determined by the stability of the target state in the force field (without restraints).

Discussion
Numerous studies have emphasized that the in-cell environment can strongly influence protein structure and interactions 4,6,7,9,66 .However, the complete atomic structure for proteins in the cellular environment has been obtained for only three proteins using NMR spectroscopy [14][15][16] .FRET has also been employed to determine a small number of separations between amino acids in several proteins in vivo 7,10,33 .In contrast, MD simulations, which have been calibrated to structures in the protein data bank, can provide the atomic coordinates of proteins in idealized, in vitro conditions.Properly calibrated force fields that would allow accurate all-atom descriptions of protein conformations in vivo currently do not exist.For example, GTT WW domain and threehelix bundle protein B (PB) fail to refold in all-atom cytoplasm computational models 23,67 , since the interatomic sticking interactions in current protein force fields increase the stability of nonnative states in all-atom cytoplasm models 68,69 .The restrained MD simulations of proteins in vivo (using residue separations measured in FRET experiments) can be carried out using current force fields, and thus we do not need to wait for improvements to the force fields in all-atom models of the cytoplasm.
While current all-atom MD simulations allow us to sample in vitro protein structures that match the x-ray crystal or NMR structures, we want to capture the all-atom conformational dynamics of in vivo protein structures.To do this, we can start with an initial in vitro protein structure, and add inter-residue distances measured by FRET as restraints in the MD simulations.However, an important question is how do we determine which amino acid pairs should be measured in the FRET experiments and then incorporated into the restrained MD simulations?In particular, what is the minimal number of restraints needed to achieve a given C α RMSD to the target structure and what is the optimal method to select these restraints?To answer these questions, we first develop the methodology of restraint selection for proteins with multiple conformational states in vitro.To connect the in vitro results to those for in vivo conditions, we studied GB1, which is one of the few proteins whose structure has been solved in vivo using in-cell NMR spectroscopy.We selected the in vitro structure as the initial structure and the in vivo structure as the target structure.We employed four methods for selecting the restraints, varied the number N r and type of restraints that are incorporated into the restrained MD simulations, and compared the performance (i.e.C α RMSD relative to the target structure) of the restraint selection methods to random selection.
We found that for T4L, PGK and AK, which possess two distinct conformational states that have been solved in vitro and are both stable in the force field without restraints, it only takes a small fraction of restraints (N r /N < 2%) to induce the conformational changes.The results emphasize that only a limited amount of information about pairwise distances between amino acids is needed to induce protein conformational changes from an initial structure to a target structure.The largest change in pairwise separation method, which uses the target structure information, gives the lowest values of C α RMSD at each N r .The normalized RMSD(N r ) for the largest change in pairwise separation method collapses for these three proteins when plotted versus N r /N , and reaches zero at a small fraction of restraints (N r /N < 1.5%).The linear discriminant analysis, which also uses the target structure information, exhibits inconsistent performance, sometimes comparable to and sometimes better than random selection.The normal mode analysis and the largest C α separation method, which do not need target structure information, give lower values of C α RMSD compared to random selection.The performance of these two methods varies slightly for different proteins.Specifically, for the normal mode analysis, RMSD(N r ) reaches zero when N r /N > 2%, which is significantly lower than the fraction of restraints used in previous studies of FRET-assisted protein structural modeling (between 5% and 12%) 36 .This result is significant since it shows that it is possible to determine the complete protein structure, using information from only ∼ 2% of the C α separations.
The studies described here provide proof of principle that FRET-assisted MD simulations can improve our understanding of the atomistic structure of proteins in vivo since only a small fraction of restraints are required to lock-in the target structures.We have shown that the characteristic fraction of restraints needed to induce conformational changes is determined by the stability of the target state in the force field without restraints.Since the in vivo structures are not stable minima of the current MD force fields, additional restraints are necessary to reach a given C α RMSD and the dependence of the C α RMSD on N r is non-monotonic.Therefore, an important future direction is to develop force fields for which the in vivo structures are potential energy minima.However, this goal requires a large number of high-quality, all-atom in vivo protein structures.In the absence of many in vivo protein structures, we can also develop in-cell mimetic systems whose crowding and surface sticking interactions have been calibrated to the in vivo studies 8,33 .
The linear discriminant analysis approach of using the direction that maximizes the differences in the initial and final state projections has been used successfully to identify the key pairwise separations that distinguish the initial and target states of small molecules 58 .However, while we find that linear discriminant analysis outperforms random selection only for T4L (N = 162 amino acids), it performs worse than random selection for PGK (N = 413) and AK (N = 214).This result suggests that as the number of amino acids increases, the dimensionality of the conformation space increases to such an extent that the linear method cannot capture the key pairwise distances that describe the structural transition.Therefore, nonlinear methods, such as the nudged elastic band method, are needed to calculate the minimum energy pathway between the initial and final states and identify the most important pairwise distances along the pathway 70,71 .
The largest change in pairwise separation method that identifies a minimal set of amino acid pairs without redundant structural information gives the lowest values of C α RMSD to the target structures.Note that one should not imagine a set of unique, optimal restraints, but an optimal pool of similar restraints that can induce structural changes (see Fig. S8 in Supporting Information).However, this method requires information about both the initial and target structures, and thus it cannot be used to predict an in vivo structural ensemble that has not yet been determined.Given that there are several NMR structures for proteins in cellular environments available, this selection method can be used in restrained MD simulations to verify that the simulations can correctly sample the conformational dynamics for these proteins in vivo.In contrast, if we aim to predict the in vivo structure of proteins, we have shown that normal mode analysis, which assumes that the normal modes of vibration of the initial structure provide information about how the protein transitions from the initial structure to the target structure, can be used to select the FRET-pair labeling positions that will enable the restrained MD simulations to sample the target in vivo structure.In this work, which considers large-scale domain motion in proteins, we showed that the normal mode analysis is a successful restraint selection method.However, it is not yet clear whether such motion is relevant for proteins in vivo.Thus, if the normal mode analysis is not efficient in inducing a change from the initial to the target in vivo structure, the largest C α separation method, or a hybrid method that couples normal mode analysis and the largest C α separation method, can also be implemented.(See Fig. S9 in Supporting Information.) It is important to note that there are several experimental limitations for selecting amino acid pairs for FRET labeling.For example, the labeled dye molecules should not perturb the protein structure.Thus, in this study, we did not select amino acid pairs that occur within the core.FRETlabeling core amino acids would likely lead to changes in protein structure.In addition, optimal efficiency for FRET-labeling is achieved when the distance d between donor and acceptor molecules is comparable to the characteristic Förster distance R 0 for each dye pair.For most common FRET chromophores, 50 Å < R 0 < 60 Å, which is larger than R g for the proteins we considered in this work.Thus, we excluded amino acid pairs that are separated by d < R g .
As mentioned above, we showed in this study that the normal mode analysis selection method (and largest C α separation method) can identify a small number of important C α -C α distance restraints that can effectively move an initial in vitro structure to a target in vivo structure.However, since the FRET energy transfer efficiency reports on the ensemble of distances between dye molecules (not C α -C α distances), quantitative FRET-assisted protein structural modeling requires the mapping between dye-dye distances and C α -C α distances 29 .Therefore, in future studies, we will identify the mapping by incorporating atomic-scale modeling of the dye molecules into the restrained MD simulations.
Finally, our current analysis of restraint selection methods focuses on monomeric, globular proteins.In future studies, we can expand the application of our restraint selection methods to intrinsically disordered proteins (see Fig. S10 in Supporting Information), membrane proteins, protein-protein and protein-nucleic acid complexes.Another interesting future direction is to develop restraint selection methods for triple restraints (co-restraints between 3 residues), using the inter-residue distances obtained from a 3-color FRET experiments 72 .

Figure 1 :
Figure 1: Illustration of several distance restraint selection methods, using T4L as the example protein.(a) The symmetric distance matrix R βδ , where the color gradient from dark to light indicates increasing C α separations between amino acids β and δ.(b) For the normal mode analysis method, we generate an ensemble of 10 4 structures (100 structures are shown in green) that have been displaced from the initial structure along a random superposition of normal modes corresponding to the ten lowest frequencies.A single amino acid pair that minimizes the C α RMSD among structures in the ensemble is highlighted in red and orange in each structure.(c) We identify the amino acid pairs with the largest C α separations R βδ in the initial (green) structure.The pair with the maximum R βδ is highlighted in red and orange.(d) We can also select restraints by identifying the largest changes in the pairwise C α separations between the initial (green) and target (blue) structures.The C α atoms of the pair with the maximum ∆R βδ (S i , S j ) are shown in red and orange.(e) 20 conformations from unrestrained MD simulations starting from the initial and target structures are shown in green and blue, respectively.Using linear discriminant analysis, we identify the collective variable Ŵ that maximizes the cross-correlation of the two ensembles.

Figure 2 :
Figure 2: (a) The top and bottom green images indicate the same starting structures for restrained MD simulations that enforce two different single restraints: R βδ = R 0 βδ (red) and R β ′ ,δ ′ = R 0 β ′ ,δ ′ (orange).The target structures, which are aligned with the starting structures such that the C α RMSD values are minimized for residues 92 to 162, are shown in blue.(b) RMSD(S i , S j ) as a function of time during restrained MD simulations for the restraints in (a), where the S i are structures sampled at each time during the restrained MD simulations and S j is the target structure.The black dashed line indicates RMSD r of the target.(c) The final frames at 20 ns from the restrained MD simulations with R βδ = R 0 βδ (top, red) and R β ′ ,δ ′ = R 0 β ′ ,δ ′ (bottom, orange).The structures from the final frames of the restrained MD simulations (green) are aligned with the target structure (blue) using the same alignment as in (a).

Figure 3 :
Figure 3: The C α RMSD, ⟨RMSD(S i , S j )⟩ S i ,Ns , averaged over structures S i from the restrained MD simulations and number of samples N s , where S j is the target structure, is plotted versus the number of restraints N r for (a) T4L, (b) PGK, and (c) AK.We show results for several restraint selection methods: random selection (blue circles), normal mode analysis (orange squares), the largest C α separation method (green crosses), the largest change in pairwise separation method (red upward triangles), and linear discriminant analysis (purple stars).The horizontal black dashed line indicates RMSD r for each target.Snapshots of the initial structures are shown for each protein in the upper right corner of each panel.The error bars give the standard error of ⟨RMSD(S i , S j )⟩ S i from N s independent simulations.

Figure 4 :
Figure 4: The normalized C α RMSD, RMSD(N r ), is plotted versus N r /N for T4L (blue circles), PGK (green triangles), and AK (red squares) using (a) the largest change in pairwise C α separation method and (b) the largest C α separation method for selecting the restraints.

Figure 5 :
Figure 5: (a) A ribbon diagram of the initial in vitro structure (green) of GB1 aligned with its in-cell target structure (blue).The gray arrows indicate the structural changes from the initial structure to the target structure in the two top loop regions.(b) ⟨RMSD(S i , S j )⟩ S i ,Ns (N r ) plotted as a function of N r /N for restrained MD simulations of GB1 using the largest change in pairwise C α separation between the initial and target structures method (red upward triangles) for selecting restraints.The horizontal black dashed line indicates RMSD r for the target.(c) RMSD({S i }, S j , β) (Eq. 3) is plotted versus amino acid index β, where {S i } are structures sampled in the restrained MD simulations and S j is the target structure.The color from dark blue to yellow indicates N r .The two top loop regions in (a) with 21 ≤ β ≤ 26 and 47 ≤ β ≤ 52 are highlighted in light gray.

Figure 6 :
Figure 6: ⟨RMSD(S i , S j )⟩ S i ,Ns plotted as a function of N r /N for restrained MD simulations of (a) GB1 and (b) TCI using the largest change in pairwise C α separation between the initial and target structures method (red upward triangles) for selecting restraints.The horizontal black dashed lines indicate RMSD r for the targets.Ribbon diagrams of the initial structures (blue) aligned with the target structures (green) for GB1 and TCI are shown in the upper right corners of (a) and (b), respectively.